FedADT: An Adaptive Method Based on Derivative Term for Federated Learning

Federated learning is served as a novel distributed training framework that enables multiple clients of the internet of things to collaboratively train a global model while the data remains local. However, the implement of federated learning faces many problems in practice, such as the large number of training for convergence due to the size of model and the lack of adaptivity by the stochastic gradient-based update at the client side. Meanwhile, it is sensitive to noise during the optimization process that can affect the performance of the final model. For these reasons, we propose Federated Adaptive learning based on Derivative Term, called FedADT in this paper, which incorporates adaptive step size and difference of gradient in the update of local model. To further reduce the influence of noise on the derivative term that is estimated by difference of gradient, we use moving average decay on the derivative term. Moreover, we analyze the convergence performance of the proposed algorithm for non-convex objective function, i.e., the convergence rate of 1/nT can be achieved by choosing appropriate hyper-parameters, where n is the number of clients and T is the number of iterations, respectively. Finally, various experiments for the image classification task are conducted by training widely used convolutional neural network on MNIST and Fashion MNIST datasets to verify the effectiveness of FedADT. In addition, the receiver operating characteristic curve is used to display the result of the proposed algorithm by predicting the categories of clothing on the Fashion MNIST dataset.


Introduction
Recently, vast amounts of data have been generated by decentralized network edge devices, such as mobile phones and smart devices in the internet of things. Collecting and transmitting this data for training not only gives rise to network congestion, but also cause privacy leakage. For this reason, Federated Learning (FL) frameworks are proposed in [1,2], where clients learn a shared global model based on their own private data under the coordination of a central server. As data holders, clients conduct multi-step model training locally on the basis of the currently received global model, and then, the central server aggregates these local models, obtaining a global model, and returns it to each client. This manner of alternating training and communication was implemented by McMahan et al. [2], resulting in the Federated Averaging (FedAvg) algorithm, which is one of the most popular methods in federated learning. In FedAvg, each client trains a model by leveraging the Stochastic Gradient Descent (SGD) method. Due to its superiority, FL is broadly used in many application scenarios [3][4][5].
Despite the good empirical performance of FedAvg, also known as local SGD [6], there are still gaps between theory and practice in FedAvg. To better understand its convergence performance in theory, several studies [7][8][9] associated with it have emerged under data homogeneous setting in federated learning. For data or client heterogeneity, Ref. [10] introduced a regularization term to local object function settling the non-identically distributed challenge. Control variate and variance reduction methods [11] were proposed to correct the bias across clients, which leads to unstable and slow convergence. Most of these studies require extra communication cost or memory, which can be costly and unpractical in federated setting. In addition, momentum-based methods are introduced into either local model updates [12] or global model updates [13,14] or both [15,16] to improve the stability convergence. The convergence performance of SGD type methods are highly sensitive to the learning rate or step size that controls the speed of model learning; hence, another line of studies has sprung up aiming at modifying the learning rate, which scales the gradient for each dimension by incorporating prior information. These methods [14,17,18] that use adaptive learning rate also adopt the momentum term that accumulates the previous information. However, the accumulated past gradient information can cause an improper model update, which, sometimes, may be opposite to the descent direction. The training will be lagged and this lag effect can lead to an oscillation phenomenon in which the learning curve fluctuates at the optimal point, which is known as the overshoot problem in the domain of control.
In the optimization algorithms for machine learning, especially for deep learning, stochastic gradient can be regarded as an error where the goal of optimization is to make the error gradually settle to zero. This has a similar spirit as the Proportional Integral and Derivative (PID) controller in the fields of control theory and engineering. The main idea of the PID-based control method is incorporating the current, past, and the future information into the current correction to adjust the input of a dynamic system so it performs as desired. The use of a feedback mechanism makes the control process more responsive and robust [19,20]. The research [21][22][23] reveals that the error or deviation in the PID controller plays a similar role as the stochastic gradient used in SGD-based methods. In addition, SGD with momentum mainly utilizes current and historical gradients to optimize the model, which can be interpreted as the proportional and integral part. This inspires us to introduce the derivative information to the local update of federated learning, which denotes the future trend of the gradient change. According to the aforementioned analysis, we consider the descent direction and learning rate simultaneously in the update rule of model parameters for better training the federated learning model. To this end, this paper integrate PID controller into local SGD for federated learning. In a nutshell, the contributions are elaborated below: • We incorporate the adaptive learning rate and derivative term in the update of the local model at the client side and propose a new federated optimization approach called FedADT. • We rigorously prove that the proposed algorithm can achieve O(1/ √ nT) convergence rate for non-convex smooth objective functions, where n is the number of clients and T is the number of iterations. • We conduct experiments for the image classification task on two datasets. The experiment results verify the effectiveness of the proposed algorithms.
The remainder is organized as follows. The related work is summarized in Section 2. In Section 3, the optimization problem is first introduced. Then, the proposed federated learning algorithm is described in detail in Section 4. In Section 5, we present related assumptions and the main results. The experiments are performed to validate the theoretical results in Section 6. Finally, we conclude the paper in Section 7.

Related Work
SGD is perhaps the most popular method, with good empirical performance in machine learning, which is also robust and scalable. Momentum is a heuristic, but a strong way to accelerate the convergence of SGD. Motivated by the heavy ball method [24] and Nesterov's accelerate gradient method [25], a momentum term is usually added in the current update of descent directions by a weighed sum of previous information to improve the convergence of SGD [26]. Sutskever et al. [27] combined SGD with a careful use of the momentum method in the training of deep neural networks successfully. The article [28] developed the final iterate with standard step size schedules, and obtained the lower bounds for the sub-optimality of SGD. The generalization performance between SGD and a full gradient descent was developed by [29], and a novel separation result was presented in the stochastic convex optimization model. Additionally, adaptive optimization methods and variants [30][31][32] have gained fruitful achievements in deep learning because of their success in practice. Reddi et al. [32] proposed an Adaptive Mean Square Gradient method (AMSGrad) to amend the convergence issues of adaptive moment estimation method (Adam). Zaheer et al. [33] utilized the effect of a mini-batch size to improve the performance of Adam. The novel variant [34], which adapts step sizes according to the belief in current gradients (AdaBelief), has a better convergence, generalization, and training stability in both convex and non-convex cases by modifying Adam without additional parameters. Besides using a step size that adjusts to the scaling of gradients, a new class of adaptive methods [35][36][37] that is based on Polyak step size has emerged, utilizing both the current loss value and the stochastic gradient. Most of the studies only focus on the online convex optimization case or require projections operation on a bounded domain. Recently, the connection between PID control and stochastic optimization was described in [23], which shows that the PID-based method is an optimizer of encapsulating the gradient and momentum. An et al. [21] proposed a novel PID optimizer, which introduced derivative action to reduce the oscillation phenomenon, also known as overshoot in the control field. The above algorithms are implemented in a centralized setting.
Distributed optimization based on parallel SGD has been developed over the past decade, which often suffers from the bandwidth limits and large network delays. To alleviate communication bottlenecks, local SGD incorporating model averaging periodically results in the FedAvg algorithm [2], which significantly reduces the communication overhead. Along this line of research, there is much work that explores the theoretical convergence and improves the performance of FedAvg. Stich [6] firstly established the upper bound for FedAvg in a convex homogeneous setting when all clients participate at each round, and later it was improved by [8] in a convex heterogeneous setting. The work [38] established a lower bound for FedAvg in a heterogeneous case. Moreover, a unified framework [39] was presented to analyze local SGD methods in convex and strong convex settings. A hybrid local SGD method [40] was proposed to speed up the training of federated learning. These studies mentioned above use SGD as the local paradigm optimizer. One can also see other variants that incorporated momentum [9] and adaptive techniques [14,18]. FedAdam [14] utilizes Adam algorithm as a local optimizer in the federated learning framework to overcome the difficulty of parameters tuning for non-convex settings. Local AMSGrad [17] was designed to accelerate training and reduce communication overhead. Additionally, PID-based federated optimization methods have been developed recently. Ref. [41] designed a privacy budget allocation protocol by computing PID errors to balance privacy guarantee and the utility of the global model. The article [42] combined a federated learning framework and PID controller to develop the deployment of future intelligent transportation systems. Inspired by these works, in this paper, we use an adaptive learning rate and derivative term to the federated setting and analyze its convergence performance.

Problem Formulation
Notation. Throughout the paper, we use x i t to denote the model parameter of i-th client at t-th iterations. Let · and · ∞ be 2 and ∞ vector norm, and (·) j denotes the j-th coordinate of a vector. The vector square and vector division are element-wise, respectively.
In this paper, we consider a general federated learning system, as shown in Figure 1. It contains n clients or devices and a central server; for example, a smart phone, industrial sensors. By collecting a large amount of production data, such as temperature, pressure, and current, federated learning can jointly model data from multiple plants without sharing trade secrets, thus improving productivity, quality, and safety. As illustrated in Figure 1, the training process of federated learning can be briefly summarized as follows: the central server firstly selected a subset of edge clients and a global model is downloaded by each client involved in the training at each round. Then, each client belonging to the subset begins multiple step local training based on its raw dataset and obtains a local model. Finally, the local models are uploaded and aggregated in the central server. The above process is repeated until the global mode converges or an expected predicted accuracy is attained. In fact, the above model training can be modeled as an optimization problem. The main goal is to find a global model parameter, denoted by the vector x ∈ R d , and the problem to be solved is formulated of the form: where d is the dimension of model parameter, f i (x) = E ξ i ∼D i F(x; ξ i ) stands for the local expected loss function of i-th client, function F(x; ξ i ) denotes the loss for the model parameter x on one example ξ i stored in the i-th client, and D i represents the data distribution of i-th client, i ∈ {1, . . . , n}. For different clients, their data distribution may be different.

Algorithm Design
In this article, we are concerned with the collaborative learning of n clients under the coordination of a central server to solve problem (1) by local training and periodic model aggregation. In order to stabilize the process of local model training, we add a derivative term that denotes the trend of the gradient change and the adaptive learning rate to the update rule of the local model parameter. The pseudo-code of our proposed method, FedADT, is summarized in Algorithm 1. Specially, at the beginning of the (t + 1)-th iteration, the central server random selects a subset of clients firstly. Each client i involved in current training computes the stochastic gradient g i t , which is an unbiased estimator of the full gradient ∇ f i (x i t ), by using mini-batch random data from the dataset of the client i. Then, it computes an exponential weighted average momentum term m i t+1 as the descent direction of model update and second order moment v i t+1 to adaptively adjust the learning rate, respectively, which are defined as follows: where β 1 , β 2 ∈ [0, 1) are decay factors which control the exponential decay rates of weighted averages. In fact, m i t+1 can be expressed as where the initial value of momentum is set to 0. The decay factors β 1 are usually chosen so that the exponential weighted averages allocate small weights to previous gradients that are far from the current moment. A similar choice applies to the decay factor β 2 , which is selected from the set {0.99, 0.9999} in the relevant papers [31,32]. Notation indicates the element-wise square.
Then, a first difference term that suggests the future information is added to correct the lagged gradient. In fact, the differential of gradient is approximated by the first difference g i t − g i t−1 , which reflects the instant variation of gradient. It is incorporated in the design of algorithm to exploit the future expectation of the model and avoid overshooting, which acts in a similar role as in the PID controller. Furthermore, in order to mitigate the noise in gradient calculation caused by randomly selecting mini-batch data, we use moving weighted average on the derivative part, resulting in: Finally, the local model parameter x i t+1 is updated, i.e., where η is learning rate, and ν is the step size of derivative term. The termv t+1 is the element-wise maximum ofv t and the average of v i t+1 across n clients, as shown in line 11 of Algorithm 1. Moreover, if t + 1 is a multiple of E, the central server averages the model parameters x i t+1 and the second moment v i t+1 , where E is a positive constant denoting the number of local updates. for client i = 1, 2, . . . , n in parallel do 3: Compute gradient: end if 14: end for 15: end for

Assumptions and Main Results
In this section, before providing the main results of Algorithm 1, we first state three assumptions as follows.

Assumption 1.
The loss function f i (x) is differentiable and L-smooth, L > 0 is a constant; that is, for ∀ x, y ∈ R d and i ∈ {1, . . . , n}, we have: Assumption 1 expounds our requirements for local objective functions, and is common in non-convex problems [14,17,43]. Next, there are two different assumptions about the stochastic gradients. Assumption 2. The stochastic gradient g i t has bounded ∞ norm, i.e., for any i ∈ {1, . . . , n} and t ∈ {1, . . . , T}: where G ∞ is a positive scalar.
The two assumptions above are common in the analysis of adaptive-type methods [14,17], which bound the gradient estimate with noise and the variance of the stochastic gradient.
From the above assumptions, we obtain the main convergence theorem for the FedADT algorithm.
, then for any T ≥ 16nL 2 dδ , we have: and f * f (x * ) is the optimal value at the optimal point x * .
We defer to the proof of Theorem 1 in Appendix A.

Remark 1.
From Theorem 1, we can see that the convergence rate of FedADT mainly relies on the initial value of function and the variance of stochastic gradients and the number of local updates. The terms involving β 1 are introduced due to the use of momentum and derivatives. Moreover, the coefficient (1 − β 1 ) 2 /β 2 1 , referred by derivative in the last term, is less than 1. In addition, the number of local update E affects both the communication efficiency and the convergence upper bound, which incurs the bias of decent directions by the local update. However, it is obvious that the terms containing E will not dominate on the right-hand side of (10) when E ≤ O( (dT) 1/4 n 3/4 ). In fact, if T > nd, we can simplify the upper bound of (10) and achieve the convergence rate O( √ d/ √ nT) for the proposed algorithm, as shown in Corollary 1.

Remark 2.
In addition, the worst case that the right-hand side of (10) can be large if δ is small will not happen. In fact, the term δ arises from the lower bound ofv t , and together with the update rule, it will quickly become at least the same in the sense of order as second moment of the stochastic gradients. Additionally, the stochastic gradients can also be small, so their ∞ norm keeps in the order of δ.
From Corollary 1, we can see that the convergence of the proposed algorithm is evaluated by 1 2 , which is exactly the lower bound of the term on the left-hand side of (10) by utilizing the inequality v t+1 ∞ ≤ G ∞ .

Experiments
In this section, we study the performance of the proposed algorithm on two standard datasets for the image classifications task in a federated setting. The MNIST dataset [44] is a set of handwritten digits from 0 to 9 which belong to 10 different categories. It contains 60,000 training samples and 10,000 training samples. Each image is a 28 × 28 pixels grey, handwritten digital image with white text on a black background. Fashion MNIST [45] is a clothing image dataset which contains 10 classes of items such as T-shirt, dress, and bag. It has the same training and test samples as the MNIST dataset, which are summarized in detail in Table 1. We evaluate our algorithm FedADT on two datasets by training a Convolutional Neural Network (CNN) as in [2], which includes two convolution layers and two pooling layers followed by a fully connected layer with more than a million parameters in total. In the experiments, we use 10 nodes and a central server to mimic the federated training setting. Each node trains a local model and uploads it to central server periodically. The central server generates a global model by aggregating local models. Here, the number of local updates is set to 5, and the local batch size is chosen from {5, 16, 64} to test for the best performance. We suppose each node takes part in the training at each communication round. We select the learning rate from {0.1, 0.01, 0.001, 0.00001, 0.00002} for the best performance, and set β 1 = 0.9, β 2 = 0.99. For baseline algorithms, we search the learning rate from the same range as above, i.e.,{0.1, 0.01, 0.001, 0.00001, 0.00002}. We set the local update number as 5 with a batch size of 16. For comparison purposes, we assume that each client has the same neural network model which is trained by different algorithms. In addition, we account for two ways of partitioning data over nodes as in [2], i.e., data homogeneity (IID setting), where the date is uniformly distributed to 10 nodes, and data heterogeneity (Non-IID setting), where the data is shuffled by digit label and then assigned to 10 nodes. Both of these divisions are balanced. All the experiments in the article are performed on a workstation with two Intel(R) Xeon(R) Silver 4114 CPUs and two NVIDIA GeForce GTX 1080 Ti GPUs, and the algorithms are implemented in PyTorch framework, which is a popular deep learning training library.
We use two common metrics: training loss and test accuracy in federated learning and plot the learning curves as increasing communication rounds to verify the performance of different methods. We conduct 1000 rounds on two datasets to compare FedADT with naive local PID and local SGD methods, which use PID [21] and SGD as the local optimizers for each node. The results are illustrated in Figure 2, which exhibits the loss curves of three federated optimization algorithms over MNIST and Fashion MNIST datasets, respectively. We can see that the proposed FedADT method consistently achieves a faster convergence performance compared with the two baseline methods both under IID and Non-IID data setting of two datasets. However, for the bottom-row of Figure 2 with the data heterogeneity among different nodes, the learning curves of naive local PID and local SGD oscillate and are unstable, which slow down the convergence and the global models require more training rounds so as to obtain the desired results. A reasonable explanation is that FedADT uses the differential term in local decent direction, as well as the adaptive learning rate in the update process of the model.  Figure 3 shows the advantage of FedADT in terms of test accuracy under different data distributions over MNIST and Fashion MNIST datasets. As expected, all the algorithms can achieve almost similar accuracy in training the network under IID data setting on both datasets. Compared with the baseline methods, the proposed algorithm achieves the highest accuracy, 99.13% and 91.98%, first on the two datasets, respectively. For the data heterogeneity among different nodes, as shown in the bottom row of Figure 3, we have observed that the best accuracy of FedADT is 4.41% and 12.25% higher than that of naive local PID (94.37%) and local SGD (86.53%) on the MNIST dataset, respectively. On the Fashion MNIST dataset, the best accuracy of FedADT is 10.87% and 12.65% higher than that of naive local PID (74.02%) and local SGD (72.24%), respectively.
We use Receiver Operating Characteristic Curve (ROC) and Area Under Curve (AUC) as evaluation standards for the quality of the proposed algorithm. The experiment is performed by training the convolutional neural network on the Fashion MNIST dataset with Non-IID data, and the experimental result is shown in Figure 4. It demonstrates the characteristic curve and AUC area for 10 different clothing items. We can observe that class 1 is best identified and has the biggest AUC value of 0.999, and class 9 and 5 have similar results as class 1, whereas class 6 is not well recognized, and has the smallest AUC value, 0.950.

Conclusions
In this paper, we focus on the federated learning, where multiple nodes or clients are jointly modeled without exchanging their privacy data. During the local model training stage, the client side generally adopts a stochastic gradient-based algorithm, which is sensitive to step size and suffers from slow convergence performance. Inspired by the control theory, we first propose a federated adaptive learning method based on the derivative term which acts in a similar role as in the PID controller. We utilize the first difference of gradient to estimate the derivative term, which reflects the instant variation of gradient and stands for the future information. We provide a convergence guarantee for our proposed algorithm; in particular, when E ≤ O( (dT) 1/4 n 3/4 ) and T > nd, the convergence rate of O(1/ √ nT) is achieved for non-convex objective functions. Finally, the experiments are performed under different data distribution cases on MNIST and Fashion MNIST datasets. Specially, for the Fashion MNIST dataset, the training loss curve of our proposed algorithm declines fastest compared to other baseline methods, and the highest accuracies of 91.98% and 84.89% are attained under IID and Non-IID cases, respectively. Similar results are obtained on the MNIST dataset, which empirically verify the effectiveness of the proposed algorithm.
In addition, the ROC curve is used to display the satisfactory result by predicting the categories of clothing on the Fashion MNIST dataset. In future work, we will consider the effective-communication mechanism without sacrificing the convergence rate. At the same time, privacy protection should be considered in the process of communication between clients and the central server.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Theorem
In this section, we analyze the convergence of the proposed algorithm following the method [9,17,18] which helps to handle the stochastic momentum term. Along this line, a virtual sequence is often useful for theoretical analysis.

Appendix A.1. Main Proof of Theorem
Before providing the main proof of Theorem 1, an auxiliary sequence y t is introduced as follows: where x t and g t are averages of local model x i t and stochastic gradient g i t across n clients from Algorithm 1, respectively, and we define x 0 = x −1 , g −1 = 0. Similarly, we define the average vectors m t = 1 n ∑ n i=1 m i t and d t = 1 n ∑ n i=1 d i t . Applying the update rule of Algorithm 1 implies that: In order to prove Theorem 1, the following lemmas are needed. We defer to their proofs in Appendix A.2.
Lemma A1. For sequence y t defined in (A1), we have: . (A5) From Lemma A1, we connect y t+1 − y t with the two terms on the right-hand side of (A5). The following two lemmas give bounds of distance of y t − x t 2 and 1 respectively.
Lemma A2. For sequence y t and average of model x t , we have: Lemma A3. For x t defined in (A4), the sequence of iterations x i t generated by Algorithm 1, for t ≥ 1, we have: Following Lemma A3, the corresponding average drift bound for 1 derived in the following Lemma.
Lemma A4. Under Algorithm 1, the following relations hold: From Lemma A4, we can provide bounds on y t+1 − y t 2 , which play an important role in the proof of the Theorem. Finally, with the previous Lemmas, we return to prove Theorem 1.
Proof of Theorem 1. Due to the smoothness of f , for t ≥ 0, we have: Summing over t from 0 to T − 1, and by Lemma 1, we have: is used from Assumption 3. We next bound the terms on the right-hand side of (A11). By (A27) and the non-decreasing property ofv t , we have: Furthermore, we bound the last term of (A13). By the smoothness of f i (x), together with (A6) and (A7), we have: Plugging (A8), (A9), (A12) and (A14) into (A11), we obtain: Dividing by ηT both sides of (A15), and rearranging the terms, we have: (A17) Plugging (A17) into (A16), together with T ≥ 16nL 2 dδ and ν = 1 √ Td , we have: Further, by Cauchy-Schwarz inequality, we have: where the second inequality is because of the Jensen's inequality, and the last is from the smoothness of f i and Lemma A3. Plugging (A19) into (A18) and multiplying both sides of this inequality by 8 yields: where y 0 = x 0 and f * = f (x * ) is the minimum value at the optimal point x * . Hence, we complete the proof.

Appendix A.2. Proof of Lemmas
Proof of Lemma A1. From the definition of y t , we have: According to (A2) and (A3), we have: which finishes the proof.
Proof of Lemma A2. Recalling (A1) and (A4), we have: Iteratively using recursion for Equation (A3), together with d 0 = 0, we can obtain: We then expand d t in (A22), noting that g −1 = 0: For t ≥ 1, substituting (A23) into (A21) yields: Furthermore, we measure the difference between y t and x t . Using inequality a + b 2 ≤ 2 a 2 + 2 b 2 implies that: (A25) m i t ∞ is first estimated by induction. In fact, by Assumption 2, the stochastic gradient has bounded the ∞ norm, and we have: where d is the dimensionality of vectors. For m i t , since m i 0 ∞ = 0, suppose that m i t ∞ ≤ G ∞ ; then: Thus, according to the definition of norm, we have: where subscripted j denotes the j-th component of vectors. The first inequality is from the non-decreasing nature ofv t and (v t ) j ≥ δ, for any j ∈ {1, · · · , d}.
Proof of Lemma A3. When t is an aggregation moment, i.e., a multiple of E, it is easy to see x t − x i t = 0. Thus, we focus on studying the upper bounds of x t − x i t when t is not a multiple of E. Let t 0 < t be the largest number of iteration that is an aggregation moment, and t − t 0 < E. According to the update rule (5), we have: Summing over i ∈ {1, · · · , n}, we further obtain: Proof of Lemma A4. According to (A5), it is easy to know that: (A36) Now, we estimate the two terms on the right-hand side of (A36). Applying the definition of norm implies that: